3  Multivariate Viz

Use this file for practice with the multivariate viz in-class activity. Refer to the class website for details.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Import data
weather <- read.csv("https://mac-stat.github.io/data/weather_3_locations.csv") |> 
  mutate(date = as.Date(date))  

# Check out the first 6 rows
# What are the units of observation?
head(weather)
        date   location mintemp maxtemp rainfall evaporation sunshine
1 2020-01-01 Wollongong    17.1    23.1        0          NA       NA
2 2020-01-02 Wollongong    17.7    24.2        0          NA       NA
3 2020-01-03 Wollongong    19.7    26.8        0          NA       NA
4 2020-01-04 Wollongong    20.4    35.5        0          NA       NA
5 2020-01-05 Wollongong    19.8    21.4        0          NA       NA
6 2020-01-06 Wollongong    18.3    22.9        0          NA       NA
  windgustdir windgustspeed winddir9am winddir3pm windspeed9am windspeed3pm
1         SSW            39        SSW        SSE           20           15
2         SSW            37          S        ENE           13           15
3          NE            41        NNW        NNE            7           17
4         SSW            78         NE        NNE           15           17
5         SSW            57        SSW          S           31           35
6          NE            35        ESE         NE           17           20
  humidity9am humidity3pm pressure9am pressure3pm cloud9am cloud3pm temp9am
1          69          64      1014.9      1014.0        8        1    19.1
2          72          54      1020.1      1017.7        7        1    19.8
3          72          71      1017.5      1013.0        6       NA    23.4
4          77          69      1008.8      1003.9       NA       NA    24.5
5          70          75      1018.9      1019.9       NA        7    20.7
6          71          71      1021.2      1018.2       NA       NA    20.9
  temp3pm raintoday risk_mm raintomorrow
1    22.9        No     0.0           No
2    23.6        No     0.0           No
3    25.7        No     0.0           No
4    26.7        No     0.0           No
5    20.0        No     0.0           No
6    22.6        No     0.8           No
# How many data points do we have?
dim(weather)
[1] 2367   24
nrow(weather)
[1] 2367
# What type of variables do we have?
str(weather)
'data.frame':   2367 obs. of  24 variables:
 $ date         : Date, format: "2020-01-01" "2020-01-02" ...
 $ location     : chr  "Wollongong" "Wollongong" "Wollongong" "Wollongong" ...
 $ mintemp      : num  17.1 17.7 19.7 20.4 19.8 18.3 19.9 20.1 19.8 20.5 ...
 $ maxtemp      : num  23.1 24.2 26.8 35.5 21.4 22.9 25.6 23.2 23.1 25.4 ...
 $ rainfall     : num  0 0 0 0 0 0 0.8 1.6 0 0 ...
 $ evaporation  : num  NA NA NA NA NA NA NA NA NA NA ...
 $ sunshine     : num  NA NA NA NA NA NA NA NA NA NA ...
 $ windgustdir  : chr  "SSW" "SSW" "NE" "SSW" ...
 $ windgustspeed: int  39 37 41 78 57 35 44 41 39 56 ...
 $ winddir9am   : chr  "SSW" "S" "NNW" "NE" ...
 $ winddir3pm   : chr  "SSE" "ENE" "NNE" "NNE" ...
 $ windspeed9am : int  20 13 7 15 31 17 30 31 24 19 ...
 $ windspeed3pm : int  15 15 17 17 35 20 7 33 26 39 ...
 $ humidity9am  : int  69 72 72 77 70 71 76 77 76 79 ...
 $ humidity3pm  : int  64 54 71 69 75 71 72 76 79 76 ...
 $ pressure9am  : num  1015 1020 1018 1009 1019 ...
 $ pressure3pm  : num  1014 1018 1013 1004 1020 ...
 $ cloud9am     : int  8 7 6 NA NA NA NA 8 NA NA ...
 $ cloud3pm     : int  1 1 NA NA 7 NA NA NA NA NA ...
 $ temp9am      : num  19.1 19.8 23.4 24.5 20.7 20.9 22.9 21.3 21.2 23 ...
 $ temp3pm      : num  22.9 23.6 25.7 26.7 20 22.6 24.9 22.2 22.2 25.1 ...
 $ raintoday    : chr  "No" "No" "No" "No" ...
 $ risk_mm      : num  0 0 0 0 0 0.8 1.6 0 0 1 ...
 $ raintomorrow : chr  "No" "No" "No" "No" ...
ggplot(weather, aes(x = temp3pm)) +
  geom_histogram(color = "white")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 19 rows containing non-finite outside the scale range
(`stat_bin()`).

# Plot 1 (no facets & starting from a density plot of temp3pm)
ggplot(weather, aes(x = temp3pm)) + 
  geom_density()
Warning: Removed 19 rows containing non-finite outside the scale range
(`stat_density()`).

# Plot 2 (no facets or densities)
ggplot(weather, aes(x = temp3pm, fill = location)) +
  geom_histogram(color = "white") 
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 19 rows containing non-finite outside the scale range
(`stat_bin()`).

# Plot 3 (facets)
ggplot(weather, aes(x = temp3pm)) +
  geom_density() + 
  facet_wrap(~ location)
Warning: Removed 19 rows containing non-finite outside the scale range
(`stat_density()`).

# Don't worry about the syntax (we'll learn it soon)
woll <- weather |>
  filter(location == "Wollongong") |> 
  mutate(date = as.Date(date))  
# How often does it raintoday?
# Fill your geometric layer with the color blue.
# If it does raintoday, what does this tell us about raintomorrow?
# Use your intuition first
ggplot(woll, aes(x = raintoday, fill = raintomorrow)) + 
  geom_bar()

# Side-by-side bars
ggplot(woll, aes(x = raintoday, fill = raintomorrow)) + 
  geom_bar(position = "dodge")

# Proportional bars
# position = "fill" refers to filling the frame, nothing to do with the color-related fill
ggplot(woll, aes(x = raintoday, fill = raintomorrow)) + 
  geom_bar(position = "fill")

# Plot temp3pm vs temp9am
# Change the code in order to indicate the location to which each data point corresponds
ggplot(weather, aes(y = temp3pm, x = temp9am)) + 
  geom_text(aes(label = location)) 
Warning: Removed 27 rows containing missing values or values outside the scale range
(`geom_text()`).

# Change the code in order to indicate the location to which each data point corresponds
# AND identify the days on which it rained / didn't raintoday
ggplot(weather, aes(y = temp3pm, x = temp9am, color = location, shape = raintoday)) + 
  geom_point() 
Warning: Removed 69 rows containing missing values or values outside the scale range
(`geom_point()`).

# Change the code in order to construct a line plot of temp3pm vs date for each separate location (no points!)
ggplot(weather, aes(y = temp3pm, x = date)) + 
  geom_line() + 
  facet_wrap(~location)

# Plot the relationship of raintomorrow & raintoday
# Change the code in order to indicate this relationship by location
ggplot(weather, aes(x = raintoday, fill = raintomorrow)) + 
  geom_bar(position = "fill") + 
  facet_wrap(~location)

4 In Class Exercises

# Import and check out data
education <- read.csv("https://mac-stat.github.io/data/sat.csv")
head(education)
       State expend ratio salary frac verbal math  sat  fracCat
1    Alabama  4.405  17.2 31.144    8    491  538 1029   (0,15]
2     Alaska  8.963  17.6 47.951   47    445  489  934 (45,100]
3    Arizona  4.778  19.3 32.175   27    448  496  944  (15,45]
4   Arkansas  4.459  17.1 28.934    6    482  523 1005   (0,15]
5 California  4.992  24.0 41.078   45    417  485  902  (15,45]
6   Colorado  5.443  18.4 34.571   29    462  518  980  (15,45]

5 Exercise 1: SAT scores

ggplot(education, aes(x = sat)) +
  geom_histogram(color = "white") 
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(education, aes(x = sat)) + 
  geom_density()

The distribution of SAT scores appears a bit bimodal, operating on the range of potential scores from a little under 800 to close to 1200. The peak appears at around 900, then there is a bit of a dip and then another set of peaks around 1050.

6 Exercise 2:

# Construct a plot of sat vs expend
# Include a "best fit linear regression model" (HINT: method = "lm")
ggplot(education, aes(y = sat, x = expend)) + 
  geom_point() + 
  geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'

# Construct a plot of sat vs salary
# Include a "best fit linear regression model" (HINT: method = "lm")

ggplot(education, aes(y = sat, x = salary)) + 
  geom_point() + 
  geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'

These relationships are both very interesting, as they are negative. As salary and state expenditure both increase, the negative trend shows that SAT scores fall.

7 Exercise 3:

ggplot(education, aes(y = sat, x = salary, color = expend)) + 
  geom_point (alpha = 0.75) + 
  geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'
Warning: The following aesthetics were dropped during statistical transformation:
colour.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?

8 Exercise 4:

ggplot(education, aes(y = sat, x = salary, color = cut(expend, 3))) + 
  geom_point() + 
  geom_smooth(se = FALSE, method = "lm")
`geom_smooth()` using formula = 'y ~ x'

It appears that as salary increases for the lowest subsection of expendature sat scores see a negative relationship that is very steep.Then as salary increases and expend is in the middle group there is still a negative relationship seen with SAT scores but it is weaker. Lastly, the highest level of salaries sees the highest levels of expendature and there is a weak positive relationship with SAT scores.

9 Exercise 5:

Part a

ggplot(education, aes(x = fracCat)) + 
  geom_bar() 

Part b

ggplot(education, aes(x = sat, fill = fracCat)) +
  geom_density( alpha = 0.5) + 
  theme_minimal()

ggplot(education, aes(y = sat, x = fracCat)) +
  geom_boxplot()

Part c

ggplot(education, aes(y = sat, x = expend, color = fracCat)) +
  geom_point(alpha = 0.75) +
  geom_smooth (method = lm)
`geom_smooth()` using formula = 'y ~ x'

This does not suggest that SAT scores decrease as spending increases and actually suggests the opposite. We see that for each of the three groups of states, as their spending increases so too do their SAT scores.

Part d Maybe because there are three such distinct groups and the relationship heavily depends on this variable. For expendatures it matters per student. With the addition of these other variables we are able to separate data points out which allows for the true relationships to be found.

North Carolina appears to be an outlier in regards to the ratio variable. Arkansas and Mississippi seem to be similar and so do New Jersey and Maryland. States that are geographically similar appear to be similar in education trends. Putting this all together you can see what the overall trends are across the US and where some of the biggest differences lie.